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Abstract 

Background: Identifying recombinations in HIV is important for studying the epidemiology of the virus and aids in 
the design of potential vaccines and treatments. The previous widely-used tool for this task uses the Viterbi 
algorithm in a hidden Markov model to model recombinant sequences. 

Results: We apply a new decoding algorithm for this HIVIIVI that improves prediction accuracy. Exactly locating 
breakpoints is usually impossible, since different subtypes are highly conserved in some sequence regions. Our 
algorithm identifies these sites up to a certain error tolerance. Our new algorithm is more accurate in predicting 
the location of recombination breakpoints. Our implementation of the algorithm is available at http://www.es. 
uwaterloo.ca/~jmtruszk/jphmm_balls.tar.gz. 

Conclusions: By explicitly accounting for uncertainty in breakpoint positions, our algorithm offers more reliable 
predictions of recombination breakpoints in HIV-1. We also document a new domain of use for our new decoding 
approach in HIVlMs. 



Background 

We consider the problem of locating recombination 
breakpoints in viral genomes. The Human Immunodefi- 
ciency Virus has been classified into several major phy- 
logenetic groups called subtypes [1]. Recombination 
between those groups is common [2], so a viral genome 
can be composed of several regions arising from differ- 
ent subtypes. Given a viral genome sequence, we would 
like to identify whether it is a recombinant, and which 
subtypes gave rise to which regions of its sequence. 

Identifying recombinations is important for epidemio- 
logical monitoring of the virus, as well as for the design 
of potential vaccines and treatments. For example, some 
subtypes develop more resistance to anti-retroviral 
drugs than others [3]. Accurate identification of recom- 
bination patterns is also crucial to detecting superinfec- 
tions in patients [4]. 

One current method for detecting recombinations 
uses profile HMMs; we describe others in the next sec- 
tion. Profile HMMs are a widely used tool in modelling 
families of sequences and can be thought of as a way of 
summarizing multiple sequence alignments of a 
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sequence family [5]. A profile HMM is composed of 
three sequences of match, insert and delete states. 
Match states correspond to consensus columns in the 
multiple alignment. Each match state can make a transi- 
tion to a corresponding insert state, which models resi- 
dues appearing between consensus columns. Finally, a 
chain of silent delete states accounts for the possibility 
of a sequence skipping some consensus columns in the 
alignment. Figure 1 shows an example of a profile 
HMM. Profile HMMs are derived from multiple 
sequence alignments using standard algorithms [6]. The 
jumping profile hidden Markov model (jpHMM) [7], as 
shown in Figure 2, represents each HIV subtype as a 
profile HMM derived from an alignment of sequences 
from this subtype. Transitions between each pair of sub- 
types are added in each column to model possible 
recombinations between subtypes. These transitions are 
called jumps and are assigned very low probability, as 
the number of recombinations in a sequence is low 
compared to its total length. Predicting recombinations 
is achieved using the classical Viterbi algorithm (see e.g. 
[6]). Schultz et al. [7,8] show that using the jpHMM 
with the Viterbi algorithm to predict recombinations 
outperforms Simplot, then the most commonly used 
recombination detection tool. 
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Figure 1 A profile liidden Marl<ov model A profile HMM for an 
alignment with 5 consensus columns. Rectangles represent match 
states. Diamonds represent insert states. Circles represent delete 
states as well as (silent) begin and end states. 



We apply a novel algorithm to further improve the 
accuracy of predictions provided by a jpHMM. Our key 
observation is that in many cases, high similarity 
between different subtypes makes it impossible to pin- 
point breakpoint positions. We focus on approximately 
predicting breakpoint positions rather than maximizing 
the probability of being exactly correct. We have 
demonstrated the efficiency of a similar decoding strat- 
egy for transmembrane protein topology prediction in 
previous work [9], where we also developed much of 
our algorithmic framework. 

Related work 

Many tools have been developed for HIV recombination 
detection. These tools can be divided into two groups: 
those based solely on comparisons between sequences 




Figure 2 A jumping profile hidden Markov model A jumping 
profile hidden IVlarkov model for 2 subtypes. States sharing the 
same label are coloured with the same colour. Red arrows indicate 
jump transitions between subtypes. For clarity, jumps involving 
insert or delete states have been omitted. 



and those which are additionally based on phylogenetic 
reconstruction. 

Among the tools based solely on sequence compari- 
son, Simplot [10] is perhaps the most well-known. It 
provides a graph showing the local similarity of a query 
sequence to a set of reference sequences. Another tool, 
RIP [11] also operates on the same principle. The jump- 
ing profile hidden Markov model (jpHMM) [7] uses pro- 
file hidden Markov models to model different subtypes 
and predict recombinations. Here, comparisons are 
made between the query sequence and profile HMMs 
describing collections of sequences belonging to the 
same subtype. 

Tools using phylogenetic information try to detect 
recombinations by reconstructing phylogenies for differ- 
ent regions of sequence alignment and comparing them 
to detect topology changes, which suggest recombina- 
tion took place. The REGA HIV subtyping tool [12] 
uses a sliding window approach to reconstruct phyloge- 
nies for different regions in the alignment, though it 
only identifies the subtypes without predicting break- 
point positions. DualBrothers [13] and cBrother [14] 
divide the sequence alignment into segments with differ- 
ent evolutionary parameters and different phylogenetic 
histories. They then use Markov Chain Monte Carlo to 
sample from the posterior distribution of such partitions 
and their corresponding trees. CARD [15] uses a genetic 
algorithm to find a maximum-likelihood partition of the 
alignment into segments with different evolutionary his- 
tories. De Oliveira Martins and Kishino [16] try to 
account for uncertainty of segment assignments by 
using a distance measure between mosaic structures and 
applying a centroid estimator. 

Several phylogeny-based tools use phylogenetic HMMs 
(phylo-HMMs) to model topology changes along the 
sequence alignment. Each state in a Phylo-HMM corre- 
sponds to a phylogenetic tree and emits a column of the 
multiple sequence alignment. Husmeier and Wright [17] 
use a phylogenetic HMM with three states to detect 
recombinations between four sequences. Kedzierska and 
Husmeier [18] propose a heuristic technique using an 
HMM that emits tree topologies. Westesson and 
Holmes [19] use a version of the EM algorithm to train 
a phylogenetic HMM with a fixed, user-defined number 
of tree topologies. Webb, Holmes and Hancock [20] 
integrate out the branch lengths in the phylogenies to 
speed up the MCMC inference. 

We note that the HMMs used in these phylogeny- 
based approaches are very different from the jpHMM 
model we employ. A state in a phylo-HMM emits a col- 
umn in the multiple sequence alignment of the analyzed 
sequences. In contrast, a jpHMM is a single-sequence 
HMM where each state emits a base of the sequence 
under analysis. While phylogeny-based methods take a 
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multiple sequence alignment as an input, the jpHMM's 
input is only the single recombinant sequence to be ana- 
lyzed: the multiple alignments already built are encapsu- 
lated in the probabilistic properties of the profile HMM. 
A state path through the jpHMM corresponds to an 
assignment of regions of the input sequence to the sub- 
types modelled by the profile HMMs. 

Phylogeny-based methods provide more flexibility in 
modelling sequence evolution, but they often require 
the knowledge of evolutionary parameters to work prop- 
erly and are computationally expensive. In contrast, pro- 
file-based approaches are simpler. They do require 
parameters that summarize existing multiple alignments, 
though this is perhaps easier data to infer than are evo- 
lutionary rates. Profile methods are also typically faster, 
as they do not require the evaluation of tree likelihoods 
for a number of different topologies; their computation 
is also accelerated using heuristics such as beam search 
[21]. Phylogenetic approaches which do not enumerate 
over many tree topologies, but instead efficiently com- 
pute the optimal ones, could in principle give fast run- 
times, but have not been used widely for HIV. 

The jpHMM does not use any phylogenetic informa- 
tion, but it differs from other phylogeny-free approaches 
in that it compares the input sequence to profile HMMs 
rather than single sequences representing different sub- 
types. Here, we focus on improving the prediction accu- 
racy of jpHMM by applying a new decoding algorithm 
to infer recombination breakpoints. 

Recently, some other approaches have also been pro- 
posed to account for uncertainty of boundary locations 
in recombination detection. Schultz et al. [8] detect 
uncertainty regions in the Viterbi prediction where the 
posterior probability of the predicted subtype is below a 
certain threshold. Nanasi et al. [22] devised a novel 
decoding algorithm for HMMs which maximizes the 
expected number of breakpoints predicted up to a certain 
error tolerance, but does not assess the joint probability 
of all breakpoints; more detail is in Nanasi's thesis [23]. 

These methods have drawbacks. The algorithm by 
Nanasi et al. is too computationally expensive to be 
practical on full-length viral sequences and has only 
been evaluated on a small region of HIV alignment. The 
newer approach of Schultz et al. [8] still uses the Viterbi 
algorithm to locate breakpoints, although it gives more 
insight into their credibility. 

While our approach bears some resemblance to the 
two methods mentioned above, it differs from both in 
that it optimizes a global objective while allowing for 
small local errors in our prediction. 

Our approach 

Schultz et al. [7] use the classical Viterbi algorithm for 
jpHMM decoding, despite its many drawbacks. The 



Viterbi path may have extremely small probability and is 
unlikely to be correct overall for sequences having more 
than a few elements. Also, we are interested in finding a 
correct labelling, rather than an optimal state path in 
the model. In many situations, likely labellings consist of 
a large number of low-probability paths. In such cases, 
the Viterbi algorithm chooses the labelling that corre- 
sponds to the most likely state path, but the posterior 
probability of such labelling may be much lower than 
the posterior probability of a labelling composed of 
many low-probability paths. This phenomenon is known 
as the path/labelling problem [24]. Even in the absence 
of the path/labelling problem, the Viterbi path is unli- 
kely to be exactly correct in cases where there is much 
uncertainty about exact border positions. Many different 
labellings will then have similar probabilities, and the 
most probable labelling will not necessarily be correct. 

In our previous work on robust HMM decoding [9], 
we ameliorate some of these problems by searching for 
the centre of the most probable ball of predictions 
rather than a single prediction. A ball of radius r 
centred at a specific labelling A for the sequence con- 
tains a number of paths with labellings at distance less 
than an upper bound r from A. The number of such 
paths grows exponentially with r. We often find balls of 
large posterior probability, giving us more confidence 
that the centre of the ball is close to the true labelling. 
We define balls on labellings rather than paths, so the 
path/labelling problem is avoided. 

Our ball algorithms, with our border shift distance 
(defined in the next section), are a natural choice for 
the problem of recombination prediction. Our priority 
in this problem is accurately predicting which subtypes 
have occured in our sequence. Since subtypes are highly 
similar in many places, exactly predicting breakpoints 
appears impossible, meaning that we should instead 
focus on approximately predicting breakpoints. Where 
subtype similarity is low, we obtain quite accurate pre- 
dictions by shrinking the ball radius. 

Preliminaries 

A labelled HMM is defined by a transition matrix A, an 
emission matrix E, and a labelling function /. The matrix 
A is a stochastic m x m transition matrix where fl,y gives 
the probability of transition from state i to state ;'; E is 
an OT X |Z| emission matrix where e/^cr gives the prob- 
ability of emitting symbol a e Z in state k. The labelling 
function / assigns to each state a single label from the 
set A of possible labels. Many states often share a single 
label. We assume that the model starts in a state Xq. 

In the context of recombination detection with jpHMM, 
each state corresponds to a column in the alignment of 
sequences from a particular subtype. For each subtype and 
each consensus column in the alignment of the subtype. 
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we have a match, insert, and a delete state. Each state is 
labelled by the subtype it belongs to. An HMM con- 
structed this way has a large number of states: around 3 • 
9000 • 14 = 378000, where the factor of 9000 is the num- 
ber of consensus columns in each subtype alignment, and 
14 is the number of subtypes modelled by the jpHMM. 
This number of subtypes is higher than the usual number 
of HIV subtypes reported in the literature [1] because the 
creators of the jpHMM-HIV chose to model the sub-sub- 
types Al, A2, Fl and F2, and the CRF 01_AE, as well as 
HIV group O and the Simian Immunodefficiency Virus, as 
separate subtypes [7]. The number of labels is 14, one for 
each subtype. 

A path in an HMM is a sequence of states x = Xq, Xi, . 
. ■ , x,j visited by the model at each step. We denote the 
corresponding sequence of emissions as ji, . . . , y„. The 
labelling X corresponding to path x is defined as A = 
t{xo), (,{xi), £(x2), ■ ■ ■ , i(x„). For a generic sequence Zi, . 
. . Zn, we write z/.to denote the subsequence z„ . . . , Zj. 
We write z| = u to denote Zi, . . . , Zj = u, . . . , u. 

Since a typical sequence contains only a few breakpoints, 
a labelling for a sequence will usually have many consecu- 
tive positions with the same label. Given a labelling A = 
i{xo),fi,fi, ■ ■ ■ ./i>/2>/2. . . . ,/2, • ■ ■ ,fh ■■ - ifh which con- 
sists of Ao followed by a number of positions labelled/i, 
then a number of positions labelled /2 (which is different 
from/i), and so on, we define its footprint to be the 
sequence/ = /i, . ■ . ,fk; this corresponds to the overall 
labelling of the sequence, but with the label boundaries 
entirely flexible. Typically, regions with the same label cor- 
respond to features: in our application, they are sequence 
intervals derived from a single subtype. Footprints are 
typically much shorter than a sequence of emissions. 

Consider two different labellings A and A' for the same 
sequence y. We define their border shift distance d{k, X') 
as a function of the feature boundaries, as follows. 

First, if A and A' have different footprints, they are 
incompatible; their distance is oo. Otherwise, let/ = /, . . 
. , fk he their common footprint, and let ^;(A) be the 
position in A corresponding to the first emission from 
the interval labelled /+i. The border shift distance 
between A and A' is then max; = i ^-i \bi{X)-bi{X')\ it is 
the maximum boundary shift between the correspond- 
ing features in two labellings. Figure 3 shows how we 
compute this distance. 

A ball of radius r centered at a labelling A, denoted as 
fi(A, r), is defined as the set of all labellings A' having 
distance at most r from A. 

Methods 

Computing ball probabilities 

Our first algorithm computes, for a given labelling A, the 
probability of a ball of radius r centered at A. We use a 
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Figure 3 The position of borders in labellings. The /th border for 
a labelling A, b,U), is the first position in A with the label /j+i. The 
shift in the /th border, between two labellings X and A' that share 
the same footprint f is |b,(A,) - fa,(A)|. 



variant of the traditional forward algorithm for HMMs 
[6], which computes the probability that an HMM gen- 
erates a sequence y, in 0{nm^) time. 

Let f = fl, . . . , fk he the footprint of A. We create k 
groups of states Gi, . . . , G^: each corresponding to the 
label set Lf., of states with label/, for each position in 
the footprint. Each state in Lf. is represented in G,. 
States in the new model have exactly the same emission 
probabilities as in M, but we may only make transitions 
from states in G; to those in G, or G,+i, with the same 
transition probabilities as in M. (To make a proper 
HMM, we can create a "dump" state for paths in M that 
do not respect the footprint we seek, or not bother.) See 
Figure 4. 

We also restrict the set of possible states visited at 
each position of the sequence. We require that transi- 
tion from the initial state has to be made to one of the 
states in Gi. The other entries in the first column of the 
forward table will be set to 0. Likewise, states from 
group G, can only be visited at positions bi^i - r to hi + 
r -1 - to ensure that the shift from G, to G/+i occurs 
between bi - r and bi + r. Finally, we require that the 
state at the last position of the sequence is in group G^- 
The standard forward algorithm computes the probabil- 
ity of B(A, r) on this HMM. 




Figure 4 Computing the probability of a footprint. To compute 
the probability of a footprint f = fi, . . . , 4, we create a group states 
G, for each entry in f, corresponding to the states in M labelled /j. A 
path compatible with the footprint must first go from Xq to one of 
the states in G,, then eventually go to G2, and on to G^. Transitions 
are only allowed from G,- to G,- or G,.,.,. 
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Let L be the size of the largest label set. The runtime 
of this algorithm is O (L^ Yl^i tij'where tj is the number 
of allowed labels at the i'-th position of the sequence. 
For r = 0, which corresponds to computing the prob- 
ability of a single labelling, we only allow 1 label at each 
position, so computing the probability of a labelling 
requires 0(L n) time. Now observe that increasing r by 
1 increases the overall number of active labels by at 
most 2k. This gives us the runtime of 0{L^(n + kr)). In 
the case of jpHMMs for HIV recombination detection, 
this runtime is small compared to the standard forward 
algorithm because L = w /14 (there are 14 subtypes and 
sub-subtypes represented in the jpHMM for HIV) and 
both k and the chosen value of r are small compared to 
n. 

Optimization 

Finding the most probable ball of radius r is NP-hard 
even when we restrict the centres to being consistent 
with a fixed footprint [9]. In practice, however, we can 
find high-probability balls efficiently by sampling from 
the distribution of state paths and using local search. If 
there exists a high-probability ball in the space of paths, 
we likely sample a path within that ball; then, the local 
search procedure moves us towards the centre of the 
ball. We used this method to predict transmembrane 
protein topology [9]. Unfortunately, our previous local 
search algorithm requires performing many forward and 
backward passes, which makes it too computationally 
expensive for use with jpHMMs. Here, we present a 
more efficient local search algorithm that exploits the 
fact that the distances between breakpoints are large. 

If the size of the footprint is 2 (only one breakpoint), 
we can recover the most probable ball of radius r by 
doing exhaustive search. We could calculate the ball 
probability for every possible placement of the break- 
point, for every choice of two recombining subtypes. 
This requires (« - 1|A| ball computations, or a total 
runtime of 0{n L |A| ), which is too large. 

We now reduce this runtime. For any HMM M, let 
(j)i,hik] = Pr[yi, . . . ,Yi,X\ = h,Xi = k\ be the probability 
that M, started at the initial state Xq, emits yi, . . . , yi 
while visiting only states with label h and that Xt = k. 
Obviously, (Pi,hU() = 0 if the label of state k is not h. 
Analogously, let Pi,j{k) = Pr[yi+i...y„, Aj'^^j = j.Xi = k] be 
the probability that M, started at state k, emits jy^+i . . . 
j„ while visiting only states labelled with /. For a label- 
ling A(/z, /, / + 1) consisting of two intervals labelled h 
and / with a breakpoint at position / + 1, the probability 
of /{.(h, j, i + 1) is the dot product of (pi^p, and ^ij. We 
obtain all forward vectors 9, ;, by running the algorithm 
for computing the probability of a labelling for a label- 
ling where all positions are labelled with h. Similarly, we 



obtain the backward vectors fitj by running a similar 
backward algorithm for a labelling consisting only of 
Once we have all the vectors, we can find the most 
probable ball by a linear scan. The overall runtime of 
this procedure amounts to 0{nL |A| ). 

In many cases, this runtime can be further improved if 
we know the location of some reference ball 5(?l', r) 
with posterior probability over 0.5. Since the path prob- 
abilities add up to 1, we know that the maximum prob- 
ability ball must intersect 5(X', r) The centres of 
intersecting balls must have distance at most 2r, and the 
footprints must be the same, so we are left with only 4r 
+ 1 candidate ball centres as opposed to « -1. We need 
to compute only around 6r dot products, giving runtime 
0{nL^). The asymptotic runtime is dominated by the 
forward and backward passes and not dependent on r 
since computing each dot product takes 0{L) time, so 
computing dot products at all positions would take O 
(nL) time. Nevertheless, considering only 6r positions 
yields a substantial practical improvement in the con- 
stant factor. 

The initial guess about the location of the reference 
ball can be obtained by sampling several paths from the 
conditional distribution Pr[x"|y"] using standard algo- 
rithms [25,26] and computing the probability of a ball 
containing all (or most) of the sampled paths. 

Similar exhaustive search procedures quickly become 
impractical as the number of breakpoints grows. How- 
ever, we can use this approach as a heuristic to optimize 
each breakpoint separately. If distances between subse- 
quent breakpoints are large, compared to the ball radius, 
there may be very little dependence between subsequent 
breakpoints, so optimizing for each breakpoint sepa- 
rately yields a good approximation to the globally opti- 
mal ball centre. For the first breakpoint, computing the 
probability of the breakpoint occuring at every position 
is carried out as in the one-breakpoint case except that 
we assume that the borders 2 to k occur within a refer- 
ence ball for which backward table values were pre- 
viously calculated. As we move to the next breakpoint, 
we calculate the forward values based on the optimal 
location of previous breakpoints, while the backward 
vectors still come from the calculation for the reference 
ball. This way, our algorithm only requires performing 
one additional forward pass after the backward pass for 
the reference ball has been performed. 

Our complete optimization algorithm is as follows. 
First, we sample 10 paths from the HMM to determine 
a reference ball for our local search. We determine the 
majority footprint in the sample set and discard samples 
with a different footprint. We estimate the mean and 
variance of each breakpoint position from the sample. 
From this estimate, we derive a biased estimate of the 
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Standard deviation which we multiply hy-^/j to get a 
biased estimate ro.sof the ball radius ro.s such that a ball 
centered at the mean labelling with radius ro.s has 0.5 
posterior probability (the factor of follows from 
Chebyshev's inequality). We then compute the probabil- 
ity of the ball centered at the mean labelling with radius 
ro.5 (in case where there are multiple breakpoints, we 
take the maximum ro.5 across all estimates). For each 
breakpoint, we perform the optimization procedure 
described above. 

Choosing the ball radius adaptively 

The choice of r in the algorithm above may greatly 
influence the results, as we shall see in our experimental 
results. For breakpoints which occur in places where the 
two subtypes are highly similar, the uncertainty about 
breakpoint positions will be greater than in low-similar- 
ity regions. Thus, we may automatically adapt the ball 
size to the amount of uncertainty in a particular break- 
point. For a sequence with many breakpoints, the uncer- 
tainty about breakpoint positions may vary between 
breakpoints, so we alter our definition of balls to accom- 
odate this. 

An interval region R(f, [v^, Wi], . . . , [v^, w^]) for a 
footprint f = f\, ■ ■ ■ , fk and a set of intervals [vi, Wi], . . . 
< X^h ^k] is a set of labellings defined as follows: 

Rif, [vi, Wi], . . . , [Vk, Wf;]) = [X . X has footprint/, and all boundaries hi{X) 

are found in the intervals[y,, u),]} 

For each breakpoint, we require that the interval cho- 
sen for that breakpoint has overall probability of at least 
half of the probability if that breakpoint was free to 
occur anywhere in the sequence. In other words, if we 
start from the reference ball as our initial interval 
region, then the probability is decreased at most by half 
each time we update an interval. Subject to that require- 
ment, we identify the smallest possible interval for each 
breakpoint. This enables intervals for different break- 
points to have different width, thus accomodating differ- 
ent levels of uncertainty about the position of different 
breakpoints. 

The algorithm for finding small interval regions is 
analogous to the algorithm for finding most probable 
balls. As in the previous case, we optimize each interval 
separately after computing the probability of the break- 
point occuring at each position of interest. We find the 
optimal interval for a single breakpoint with a simple 
greedy linear-time algorithm: traverse the list of prob- 
abilities {pi, P2, — , Pn} and maintain a variable v. At step 
i, V denotes the maximal index such that Y^'j^ypj > 0.5 
(if no such index exists, we set v = 0). When the algo- 
rithm moves to the {i + l)-st step, v is increased if pos- 
sible. The minimum interval corresponds to the 



minimum value of i - v found during the execution of 
the algorithm. 

Reducing the number of active states by beam search 

Since our jpHMM has over 300, 000 states, running the 
standard forward and backward algorithms is computa- 
tionally infeasible. We speed up the runtime of the algo- 
rithm by noticing that, at any sequence position, most 
entries in the forward table contain values very close to 
zero. Ignoring those entries has little impact on predic- 
tions but can greatly reduce computation time. Beam 
search [21] limits the number of active states at each 
position, by ignoring every entry in the dynamic pro- 
gramming table whose value is below q?* where is 
the entry with the highest value and C is some prede- 
fined threshold. Following the decision made by the 
creators of jpHMM [7], we set C to 10'^". 

Results and Discussion 

Data used 

We have evaluated the performance of our local search 
algorithms on several semi-synthetic data sets and one 
real data set. All of the semi-synthetic data sets were 
produced by artificially recombining real viral sequences 
from different subtypes. We have also run our algo- 
rithms on real recombinants and compared their predic- 
tions to published breakpoints and to the predictions 
made by other algorithms. It is not possible to fully eval- 
uate our approach on real recombinants because their 
precise breakpoints are unknown, and human annota- 
tions are performed with the assistance of a variety of 
tools including ones we are comparing our work against 
[27]. By artificially recombining real sequences, we hope 
to create data sets that resemble real sequences as clo- 
sely as possible while still having known ground truth 
values. Our implementations are available at http:// 
www.cs.uwaterloo.ca/~jmtruszk/jphmm_balls.tar.gz. 

We compared our method to the standard implemen- 
tation of jpHMM decoding using the Viterbi algorithm, 
and to DualBrothers [13], which was chosen as a repre- 
sentative of phylogeny-based methods for recombination 
detection. We also tried to run ST-HMM [20], but it 
appears prohibitively slow for this task. DualBrothers 
takes as input a multiple alignment of subtype reference 
sequences and the query sequence and outputs a collec- 
tion of trees together with their posterior probabilities 
for each column of the alignment. We chose one 
sequence from each subtype to serve as reference. For 
each query sequence, we used MAFFT [28] to align it to 
the reference sequences. We then ran the MCMC sam- 
pler for 50000 steps, discarding the initial 25000 steps as 
burn-in. For each position in the query sequence, its 
subtype was assigned by taking the most probable 
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phylogeny at this position and finding the subtype that 
was topologically closest to the query sequence in that 
phylogeny. We broke ties arbitrarily. 

We used three data sets from Schultz et al. [8] which 
we call the 1000-1000, 500-1500 and 300-1500 data 
sets. In all data sets, each artificial sequence was cre- 
ated by taking two real sequences from different sub- 
types, aligning them and composing them into one 
sequence by alternating intervals from each parent 
sequence. For example, in the 300-1500 data set, the 
first interval consisted of roughly 300 initial positions 
from the first sequence, followed an interval of length 
1500 from the second sequence starting immediately 
after the base that was aligned to the last base from 
the first interval. The third interval had 300 positions 
taken from the first sequence and so on. The parental 
sequences were taken from the 6 most common HIV 
subtypes (A, B, C, D, F, G) and CRF 01_AE. Each of 
these three data sets has 40 sequences generated this 
way. 

In another data set from Schultz [29], with single 
breakpoints, each sequence consisted of two intervals 
taken from two parental sequences. The breakpoint 
locations were chosen uniformly at random. Here, par- 
ental sequences from all subtypes and sub-subtypes 
were used. This data set contains 106 sequences. In a 
second set of experiments, we tested the ability of the 
Viterbi algorithm and our ball algorithms to identify 
short inserted regions in the sequence that come from a 
different subtype than the rest of the sequence. If a 
region is too short, it will not be detected by any algo- 
rithm since the model will favour paths that do not 
make two low-probability jumps into and out of a differ- 
ent subtype. Our ball algorithms should be more cap- 
able of identifying such short regions. To test this 
hypothesis, we generated sequences with two break- 
points, with a fixed distance between the breakpoints. 
Each sequence was a recombinant of two out of eleven 
most common subtypes, with one sequence for every 
choice of the two subtypes. We generated four data sets 
this way for different insertion lengths (see Additional 
file 1). 

In the final set of experiments, we run all the algo- 
rithms on 20 real CRFs taken from the Los Alamos HIV 
sequence database [27]. We used the reference strains 
for CRFs 2 to 9, 11 to 21, and CRF 23. When comparing 
the predictions to the published breakpoints, we dis- 
carded the 5 CRFs whose annotation had been produced 
with the aid of jpHMM for a more fair comparison; we 
also discarded all CRFs whose annotation contained 
intervals labelled as ambiguous or unknown. This pro- 
duced a set of 9 CRFs. When comparing the predictions 
of various algorithms, we used the whole data set of 20 
CRFs. 



We used two accuracy measures to evaluate the algo- 
rithms. For each true breakpoint recovered by an algo- 
rithm, we calculated the distance between the predicted 
and actual (or reported) breakpoint positions. We report 
the median of these distances for each experiment. In 
addition, we report the breakpoint sensitivity, defined as 
the fraction of the true breakpoints that were identified 
by an algorithm. We consider a true breakpoint identi- 
fied if a prediction contains a breakpoint within 300 
positions of the true breakpoint and the subtypes on 
both sides of the predicted breakpoint match the true 
subtypes. If the prediction contains a breakpoint that 
does not correspond to a true breakpoint in the above 
way, we report it as a false positive. 

Overall results 

Table 1 shows the median breakpoint error for the first 
set of experiments. For r = 5 and r = 7, our algorithms 
outperformed Viterbi predictions in terms of the median 
distance between predicted and actual breakpoints. The 
adaptive region algorithm gave similar results, but it did 
not offer any further improvement over the ball algo- 
rithms for the above parameter choices. DualBrothers 
had much higher median distance than all algorithms 
based on the jpHMM. 

We used the Wilcoxon signed rank test to compare 
the distribution of breakpoint errors between Viterbi 
and ball algorithms. For r = 5, the jj-values for the four 
data sets in Table 1 are 4 -10'^°, 3 -10 ^ 3 -10 ^ and 3 
•10'^*, respectively. 

The HMM algorithms all managed to reconstruct the 
vast majority of breakpoints, while DualBrothers did 
not. Our ball algorithms noticeably outperformed 
Viterbi on the 300-1500 data set. For the other data 
sets, the differences were negligible. Table 2 shows the 
sensitivities for all experiments. In this set of experi- 
ments, no false positives were found for Viterbi. Our 
algorithms produced no false positives on all data sets 



Table 1 Comparison of different algorithms 



Method 




Single 
breakpoint 


1000- 
1000 


500- 
1500 


300-1 500 


Viterbi 




7 (3..17) 


6 (2..17) 


7 (2..20) 


7 (3..16) 


max ball, r = 


1 


6 (2..16) 


6 (2..18) 


7 (1..19) 


7 (Z.16) 


max ball, r = 


2 


5 (2..16) 


6 (1..17) 


6 (2.. 18) 


6 (Z.15) 


max ball, r = 


5 


4 (1..12) 


5 (3..14) 


5 (3..15) 


5 (2..13) 


max ball, r = 


7 


4 (2..12) 


5 (3..14) 


6 (3.. 14) 


6 (3..12) 


max ball, r = '• 


0 


6 (3..10) 


7 (3..14) 


8 (3..14) 


7 (3..10) 


max ball, r = ' 


5 


7 (3..11) 


9 (4.. 14) 


10 (5.. 15) 


7 (4.12) 


min region 




4 (1..13) 


5 (2..14) 


6 (2.. 16) 


5 (2..14) 


DualBrothers 


57(45.71) 


51 (42..67) 


48(37..54) 


45 
(40..51) 



Shown are the median and interquartile range of differences between 
predicted and actual breakpoints for the four initial experiments. 
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Table 2 Number of breakpoints recovered 



Method 


Single 
breakpoint 


1000- 
1000 


500- 
1500 


300- 
1500 


Viterbi 


105/106 


359/360 


327/360 


344/360 


max ball, r = 5 


105/106 


360/360 


326/360 


350/360 


max ball, r = 7 


105/106 


360/360 


326/360 


350/360 


max ball, r = ]0 


105/106 


360/360 


326/360 


350/360 


max ball, r=]5 


105/106 


360/360 


326/360 


350/360 


min region 


105/106 


360/360 


326/360 


350/360 


DualBrothers 


76/1 06 


146/360 


81/360 


27/360 



Shown are the fractions of recovered breakpoints for each experiment. 



except the 500-1500 data set, where each ball algorithm 
produced 2 false-positive breakpoints. DualBrothers 
recovered less than half of the breakpoints on each data 
set except the first and also gave a considerable number 
of false positives (120,122,130 and 104 for single break- 
point, 1000-1000, 500-1500 and 300-1500 data sets, 
respectively). DualBrothers tended to make the most 
mistakes on sequences derived from subtype A, often 
confusing it with CRF Ol. 

In the second set of experiments, we see that the 
accuracy of both Viterbi and ball algorithms deteriorates 
gradually as interval length decreases. Ball algorithms 
managed to identify more correct breakpoints for all 
interval lengths. For all algorithms, the number of false- 
positive breakpoints increases as intervals get smaller, 
with ball algorithms yielding slightly higher false-positive 
rates than Viterbi. The results are shown in Table 3 and 
Table 4. Since the choice of ball radius does not affect 
the number of correctly identified breakpoints, we only 
report results for r = 5. Among the correctly predicted 
breakpoints, the median error rates for ball algorithms 
were smaller than for Viterbi similarly to the first set of 
experiments(results not shown). 

For the 9 real recombinants whose published break- 
points had not been inferred using jpHMM, the ball 
algorithm and Viterbi generally gave very similar results 
(see Table 5). Both algorithms predicted roughly half of 
the published breakpoints; the balls algorithm predicted 
two more than did Viterbi. Somewhat surprisingly, 
Viterbi had slightly lower median breakpoint distance to 
the published breakpoints, though this result is not sig- 
nificant according to the Wilcoxon signed-rank test {p = 
0.42). DualBrothers recovered fewer of the published 



Table 3 True positives on short intervals 



Method 


250 


200 


150 


100 


Viterbi 


96/1 1 0 


79/110 


45/1 1 0 


25/110 


max ball, r = 5 


100/110 


85/110 


58/110 


29/1 1 0 



Shown are the fractions of recovered breakpoints for different interval 
lengths. 



Table 4 False positives on short Intervals 



IVlethod 


250 


200 


150 


100 


Viterbi 


2/110 


4/110 


8/110 


10/110 


max ball, r = 5 


2/110 


4/110 


11/110 


14/110 



Shown are the fraction of false positive breakpoints for different interval 
lengths. 



breakpoints and had substantially higher median break- 
point distance to the published breakpoints that it 
recovered than the other two algorithms. Figure 5 shows 
predictions of all the algorithms for one of the published 
CRFs. 

There was not much agreement between DualBrothers 
and the other two programs on all real recombinants. 
Only 38 out of 150 breakpoints identified by DualBr- 
others for all 20 recombinants were also identified by 
jpHMM. In contrast, 136 out of 146 breakpoints identi- 
fied by the ball algorithm were identified by jpHMM. 
The median distance between corresponding break- 
points was 61 between DualBrothers and jpHMM, com- 
pared to 4 between balls and jpHMM. This is not 
surprising since the ball algorithm uses the same model 
parameters as jpHMM. 

The balls found by our ball algorithm often contain 
most of the probability mass given the model and the 
sequence. Out of the 106 sequences in the single-break- 
point dataset, 81 of the balls found by our algorithm 
had over 0.5 posterior probability, for r = 5. 

Our prototype C++ implementation of ball algorithms 
processed the 1000-1000 dataset in about 15.5 hours on 
a standard desktop computer. The approximate time to 
analyze a sequence (average length 9256) is 23 minutes 
with our ball algorithms, which compares well with the 
17.5 minutes required to use the simpler Viterbi algo- 
rithm, in our prototype. Schultz et al. [7] achieve a sub- 
stantial speedup, to 6.7 minutes per sequence, using 
their implementation of Viterbi, but we expect that opti- 
mizing our methods in this domain will require only 
small overhead, compared to their jpHMM-HIV pack- 
age. The running time of DualBrothers was about 120 
minutes per sequence. While a faster implementation of 
DualBrothers exists [14], we were not able to compile it 
on our system. 

Error analysis 

The strikingly lower sensitivity in all experiments on the 
500-1500 data set is due to an artifact of the data set 
rather than flaws in the algorithms. In 33 sequences from 
that data set, some of the true breakpoints were located 
after the last consensus column of the multiple sequence 
alignment of HIV sequences. The bases in that region are 
modelled by an additional insert state emitting letters 
according to a background frequency, so the model cannot 
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Table 5 Comparison with published breakpoints 



Method 


predicted 


not predicted 


extra breakpoints 


distance 


Viterbi 


39 


33 


5 


1 7(4..54) 


max ball, r = 5 


41 


31 


5 


21(6..53) 


DualBrothers 


22 


50 


19 


65(37..82) 



For each algorithm, we report the number of published breakpoints that have been predicted by the algorithm and the number of predicted breakpoints that do 
not correspond to any of the published breakpoints, as well as the median distance between published and predicted breakpoints. 



distinguish between any two subtypes in that region. Con- 
sequently, no algorithm that uses jpHMM will be able to 
predict any recombination breakpoints in the unconserved 
region at the end of the sequence. 

The higher median breakpoint error for ball algorithms 
with larger radii is because in many cases, most of the 
probability mass is concentrated in a small fragment of the 
most probable ball which is located away from the centre. 
Moving away from the concentration region might 
increase the ball probability by a very small amount, which 



causes the algorithm to pick a ball whose centre is rela- 
tively far from the region where the breakpoint is likely. 
On the other hand, if the ball radius is too small (e.g. r = 
2) the algorithm often picks small regions of high prob- 
ability and ignores many other likely breakpoint locations, 
which causes errors similar to those made by Viterbi. 

Conclusions 

We have presented two novel algorithms for predicting 
recombinations in viral genomes. The ball algorithm 
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Figure 5 Predicted breakpoints for CRF07_BC. The published recombinant structure is shown on the top picture, followed by predictions 
from the ball algorithm, Viterbi, and DualBrothers. None of the programs recovered all of the published breakpoints. The balls algorithm did not 
include 2 of the published breakpoints. Viterbi missed 4 published breakpoints, whereas DualBrothers missed 6. Recombinants were drawn using 
the HIV recombinant drawing tool [30]. 
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identifies recombination breakpoints up to a user- 
defined accuracy, while the interval region algorithm 
attempts to choose the accuracy adaptively based on the 
observed uncertainty of breakpoint locations. Both algo- 
rithms account for the inherent uncertainty about pre- 
cise breakpoint locations. This leads to an improvement 
over the Viterbi algorithm for jpHMM in both identify- 
ing the correct breakpoints and their approximate loca- 
tions. Our algorithms also show improvement on 
difficult instances of the problem where distances 
between breakpoints are small. 

Both algorithms have similar accuracies in our experi- 
ments. Contrary to our expectations, the interval region 
algorithm does not perform better than the ball algo- 
rithm with suitable parameter choices. The interval 
region algorithm may still be preferred in situations 
where there is a lot of uncertainty about some break- 
point positions but little uncertainty about others. 

Additional material 



Additional file 1: Data sets. This file contains sequences and their 
annotations for the single-breakpoint and short interval data sets. 
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